Skip to content

Add a HybridReader for use in write constrained databases#423

Open
jimhester wants to merge 5 commits intoposit-dev:mainfrom
jimhester:pr2-hybrid-reader
Open

Add a HybridReader for use in write constrained databases#423
jimhester wants to merge 5 commits intoposit-dev:mainfrom
jimhester:pr2-hybrid-reader

Conversation

@jimhester
Copy link
Copy Markdown

@jimhester jimhester commented May 1, 2026

Summary

Adds HybridReader, a Reader that composes any primary reader (the
"data" side) with an in-process DuckDBReader (the "staging" side).
register() writes go to staging; execute_sql routes queries that
mention any registered name to staging, and everything else to the
primary.

Behind the existing duckdb feature flag — no new feature, no new
dependencies.

Companion design comment with the broader sequencing context:
#341 (comment). Related to but implementation distinct from #422

Motivation

Some data sources are read-only by nature (Flight SQL servers, anonymous
Trino) or expensive to write to repeatedly during visualization
iteration (Snowflake). HybridReader composes a primary reader (the
remote data source) with a local DuckDB instance (staging). register()
writes to staging — sidestepping read-only or auth restrictions — while
execute_sql routes queries to the right side based on which tables
they reference. Same Reader interface; no caller-visible difference.

The design also pairs naturally with a query-result cache (PR3) that
memoizes remote query results in the staging DuckDB. The cache isn't in
this PR, but the staging plumbing it relies on is.

Design

HybridReader owns:

  • data: Box<dyn Reader + Send> — the primary backend.
  • staging: DuckDBReader — an in-process DuckDB instance.
  • staged_names: RefCell<HashSet<String>> — the names register() has
    put into staging.

The routing predicate references_staged_name is a lightweight SQL
scanner — not a full parser. It checks whether any registered name
appears as a SQL identifier (with identifier-boundary respect, qualified
references like catalog.schema.name, double-quoted identifiers, and
single-quoted-string-literal exclusion). Comments are not currently
parsed: a stray identifier inside a -- comment could route a
primary-data query to staging, where it would fail with a clear error
rather than succeeding against the primary backend.

Reader::dialect() returns the staging DuckDB dialect, because all
internally-generated SQL (stat transforms, layer filters, temp-table
DDL) targets staging. Callers that need the primary's dialect (e.g.
schema introspection of the remote catalog) get it via the inherent
HybridReader::data_dialect() method.

Limitations (documented)

A single SQL statement cannot reference both staged names and
primary-data tables. Queries are dispatched whole; cross-backend joins
are unsupported. Materialize one side into staging first if you need to
combine them. There is a regression test pinning this behavior.

Staged data lives in the in-process DuckDB instance and is released
when the HybridReader is dropped — no spill-to-disk, no shared cache.

Testing

All tests are offline, no external setup:

  • Routing scanner (9 tests): empty registered-name set, no match,
    single match, rejection of longer-identifier overlap (orders should
    not match orders_detail), rejection of identifier-prefix overlap
    (col should not match col_id), rejection of single-quoted-string
    contents, match of double-quoted identifiers, match of qualified
    references (catalog.schema.orders), and SQL-standard '' escape
    inside a string literal.
  • Reader behavior (5 tests): register delegates to staging and
    tracks the name; execute_sql routes a registered name to staging;
    execute_sql routes an unregistered name to data; unregister
    delegates and untracks; dialect() returns the staging dialect with
    a discriminating SQLite-on-the-data-side setup.
  • Cross-side limitation (1 test): a query referencing both staged
    and primary-only names routes wholly to staging and surfaces a
    staging-side error rather than silently joining. The setup
    discriminates correct routing from a wrong-route that would
    otherwise succeed.

The dialect-discrimination test uses a SqliteReader for the data
side (Ansi CASE-form sql_greatest) against a DuckDB staging
(GREATEST(a, b)), so a regression that returned the data dialect
instead of staging's would fail visibly. Gated on the sqlite feature,
which is in upstream's default feature set.

What's next

Per the design comment, a follow-up PR adds:

  • PR3: A query-result cache in the staging DuckDB
    (hybrid_cache.rs), a Reader::clear_cache() trait default,
    Vega-Lite v5+v6 mime emission in the Jupyter kernel, and the
    -- @uncache Jupyter meta-command.

The cache makes the iterate-on-remote case sub-millisecond on cache
hits while keeping the same Reader interface; it's gated by an env
var and fronted by a public CacheConfig for callers that want to
tune TTL or the byte budget.

jimhester added 5 commits May 1, 2026 16:01
Wraps any Reader (the data side) with an in-process DuckDBReader (the
staging side). register() writes to staging; execute_sql routes whole
queries to staging or the primary based on whether they reference any
registered name. Behind the existing 'duckdb' feature.

Tests cover the routing scanner (identifier-boundary checks, qualified
references, double-quoted identifiers, string-literal exclusion),
register/unregister name tracking, dialect dispatch, and the documented
cross-side limitation.
Per code review: the original tests for routing direction and dialect
selection used identical setup on both sides, so they passed regardless
of impl correctness. The dialect test now uses a SqliteReader on the
data side (SQLite dialect) so the staging-vs-data distinction surfaces
in sql_greatest output, and the cross-side test now registers
staged_only in both data and staging with different values so a
wrong-route would succeed silently rather than erroring for the same
reason as the correct route. Also corrects an inverted "false-negative"
label and softens the misleading "comments are harmless" note in the
references_staged_name doc-comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant